March Madness Marchine Learning Mania
2nd Place Kaggle Solution

Reference to any specific commercial product, process, or service, or the use of any trade, firm, or corporation name is for the information and convenience of the public and does not constitute endorsement, recommendation, or favoring by the Department of Defense of United States Government.

Introduction

Who are we?

MAJ Dusty Turner

  • Education
    • United States Military Academy 07, Bachelor of Science in Operations Research
    • Missouri University of Science and Technology, Master of Science in Engineering Management
    • THE Ohio State University, Master of Science in Industrial and Systems Engineering
    • THE Ohio State University, Graduate Minor in Applied Statistics
  • Work
    • Schofield Barracks, Hawaii, Engineer Patoon Leader / Executive Officer / Assistant S3 (Operation Iraqi Freedom)
    • White Sands Missile Range, New Mexico, Engineer A S3/Commander (Operation Enduring Freedom)
    • West Point, New York, Assistant Professor
    • Fort Belvoir, Virginia, Operations Research Systems Analyst / Data Scientist

MAJ Jim Pleuss

  • Education
    • United States Military Academy 07, Bachelor of Science in Computer Science
    • Kansas State University, Master of Science in Operations Research
  • Work
    • Okinawa, Japan, Signal Battalion Platoon Leader / Executive Officer / Assistant S3
    • Fort Riley, Kansas, Battalion S6 / Commander (Operation Enduring Freedom)
    • West Point, New York, Assistant Professor / Assistant Dean for Plans, Analysis, and Personnel

Both of us

  • Starting PhD program in Fall 2022 with follow-on to Math Department at USMA
  • Watch too much sports
  • Enjoy R
  • Kaggle history

Who are you?

Hopefully you are…

  • Someone really interested in basketball
  • Someone who really enjoys machine learning

What is Kaggle?

Kaggle is pronounced with a short ‘a’ and rhymes with ‘gaggle’, ‘haggle’, etc.

Kaggle hosts machine learning competitions ranging from image classification, text analysis, accuracy competitions, and games. Monetary prizes are awarded for some competitions. Kagglers are ranked for their contributions in competitions, data-sets, notebooks, and discussion participation.

Companies sponsor competitions to hire talent and to have their problems solved fairly cheaply by a community of workers.

What is Kaggle’s March Madness Machine Learning Mania Competition?

Yearly, Kaggle hosts an National Collegiate Athletic Association (NCAA) Men’s tournament competition. Unlike traditional March Madness competitions, in these competitions teams submit a probability of victory for one team over the other for every possible combination of games that might be played (vs. a traditional bracket selection). This guarantees you have made a pick for every game.

Teams are scored in the following way:

\[-\frac{1}{n}\sum_{i=1}^{n} [ y_i \ln{(\hat{y}_i)} + (1-y_i) \ln{(1-\hat{y}_i)}]\]

where

  • \(n\) is the number of games played
  • \(\hat{y}_i\) is the predicted probability of team 1 beating team 2
  • \(y_i\) is 1 if team 1 wins, 0 if team 2 wins

Example Submission

# A tibble: 6 x 2
  ID               Pred
  <chr>           <dbl>
1 2021_1101_1104 0.181 
2 2021_1101_1111 0.459 
3 2021_1101_1116 0.0883
4 2021_1101_1124 0.213 
5 2021_1101_1140 0.283 
6 2021_1101_1155 0.322 

Visual of Scoring

Basketball Philosophy

  1. Focus on how teams are playing at the end of the season. Games in November don’t mean a lot in March.
  2. March Madness is all about match-ups. Give the model the opportunity to find favorable match-ups by providing data specifying “types” of teams (e.g., slow paced teams, run and gun, highly efficient, 3-point shooting teams).
  3. Take advantage of the wisdom of crowds.

Data

Kaggle provides a lot of game and team-level data but it is very raw and granular.

Ranking Data

Pull all the team rankings across 40 different ranking agencies for each team at the end of the season. We used the average of all these rankings to gain the “wisdom of the crowds”.

# A tibble: 6 x 8
  Season RankingDayNum TeamID   MOR   POM   RPI   SAG   WLK
   <dbl>         <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1   2003           154   1102   132   160   158   149   165
2   2003           154   1103   139   163   182   172   172
3   2003           154   1104    26    33    38    37    36
4   2003           154   1105   309   307   313   312   310
5   2003           154   1106   294   263   248   268   254
6   2003           154   1107   316   312   294   307   309

End of Season Summary Statistics

Summarize all team statistics at the end of the regular season.

summary_stats <-
read_csv(here("01_data/MRegularSeasonDetailedResults.csv")) %>% 
  rename_with(.fn = ~str_replace(string = .,pattern = "W",replacement = "A"), .cols = starts_with("W")) %>% 
  rename_with(.fn = ~str_replace(string = .,pattern = "L",replacement = "B"), .cols = starts_with("L")) %>%
  mutate(win = if_else(AScore > BScore, 1, 0), win_by = AScore - BScore) %>% 
  select(Season, DayNum, TeamID = ATeamID, AScore, Loc = ALoc, AFGM:APF, win, win_by) %>% 
  pivot_longer(cols = -c(Season,DayNum,TeamID,Loc)) %>% 
  bind_rows(

read_csv(here("01_data/MRegularSeasonDetailedResults.csv")) %>% 
  rename_with(.fn = ~str_replace(string = .,pattern = "W",replacement = "A"), .cols = starts_with("W")) %>% 
  rename_with(.fn = ~str_replace(string = .,pattern = "L",replacement = "B"), .cols = starts_with("L")) %>%
  mutate(win = 0, win_by = BScore - AScore) %>% 
  select(Season, DayNum, TeamID = BTeamID, BScore, Loc = ALoc, BFGM:BPF, win, win_by) %>%
  mutate(Loc = case_when(Loc == "A" ~ "H",
                         Loc == "H" ~ "A",
                         TRUE ~ Loc)) %>% 
  pivot_longer(cols = -c(Season,DayNum,TeamID, Loc)) 

) %>% 
  mutate(name = str_sub(name, 2,-1)) %>% 
  group_by(Season, TeamID, name) %>% 
  summarise(summary_val = mean(value, na.rm = T)) %>% 
  pivot_wider(names_from = name, values_from = summary_val) %>% 
  ungroup() %>% 
  rename(avg_win = `in`, avg_win_by = `in_by`) %>% 
  relocate(c(avg_win,avg_win_by), .after = last_col())

summary_stats_final <-
  read_csv(here("01_data/MNCAATourneyDetailedResults.csv")) %>% 
  rename_with(.fn = ~str_replace(string = .,pattern = "W",replacement = "A"), .cols = starts_with("W")) %>% 
  rename_with(.fn = ~str_replace(string = .,pattern = "L",replacement = "B"), .cols = starts_with("L")) %>%
  mutate(win = if_else(AScore > BScore, 1, 0), win_by = AScore - BScore) %>% 
  select(Season, DayNum, TeamID = ATeamID, AScore, AFGM:APF, win, win_by) %>% 
  pivot_longer(cols = -c(Season,DayNum,TeamID)) %>% 
  bind_rows(
    read_csv(here("01_data/MNCAATourneyDetailedResults.csv")) %>% 
      rename_with(.fn = ~str_replace(string = .,pattern = "W",replacement = "A"), .cols = starts_with("W")) %>% 
      rename_with(.fn = ~str_replace(string = .,pattern = "L",replacement = "B"), .cols = starts_with("L")) %>%
      mutate(win = 0, win_by = BScore - AScore) %>% 
      select(Season, DayNum, TeamID = BTeamID, BScore, BFGM:BPF, win, win_by) %>% 
      pivot_longer(cols = -c(Season,DayNum,TeamID))
    ) %>%
  mutate(name = str_sub(name, 2,-1)) %>% 
  group_by(Season, TeamID, name) %>%
  summarise(summary_val = mean(value, na.rm = T)) %>%  
  pivot_wider(names_from = name, values_from = summary_val) %>% 
  ungroup() %>% 
  rename(avg_win = `in`, avg_win_by = `in_by`) %>% 
  relocate(c(avg_win,avg_win_by), .after = last_col())

head(summary_stats_final)
# A tibble: 6 x 18
  Season TeamID   Ast   Blk    DR   FGA  FGA3   FGM  FGM3   FTA   FTM    OR
   <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1   2003   1104  13    6     20    52    12    22    5     16   13      9  
2   2003   1112  18.8  4.5   30.2  67.8  20.8  31    7.75  19.2 15     13.5
3   2003   1113  14.5  4.5   20.5  64    14.5  29.5  5     22.5 16     13.5
4   2003   1120  12.3  5.33  23.7  58.7  19.7  25.3  7     16.7 12.7   13.3
5   2003   1122  11    3     21    54    21    24    2     22   14      8  
6   2003   1139  12    1.33  17    47.3  17.7  21.3  7.67  14.7  9.67   8  
# ... with 6 more variables: PF <dbl>, Score <dbl>, Stl <dbl>, TO <dbl>,
#   avg_win <dbl>, avg_win_by <dbl>

External Efficiency Data

These data came from the team rankings website. It is a snapshot of several advanced end of season statistics.

  • Offensive/Defensive efficiency
  • Possessions per game
  • Free throw rate
  • Strength of schedule
  • Percentage of points from 3-pointers

Hierarchical Clustering

We then used hierarchical clustering to break every Division 1 team from the past 15 years into 10 clusters using many of the metrics above.

# A tibble: 6 x 15
  TeamID Season rank_avg    OE    DE clust last3offensive_ef~ last3possessions_~
   <dbl>  <dbl>    <dbl> <dbl> <dbl> <dbl>              <dbl>              <dbl>
1   1266   2003    12.8   1.19 1.01      1               1.12               73.1
2   1196   2003     9.82  1.13 0.96      1               1.08               68.4
3   1166   2003    20.7   1.12 0.916     2               1.03               68.9
4   1139   2003    41.1   1.12 0.984     3               1.06               60.6
5   1462   2003    15.7   1.11 0.927     2               1.21               70  
6   1458   2003    19.8   1.11 0.909     1               1.13               59.7
# ... with 7 more variables: last3defensive_efficiency <dbl>,
#   overallfree_throw_rate <dbl>, col <dbl>,
#   ratingschedule_strength_by_other <dbl>,
#   overallpercent_of_points_from_3_pointers <dbl>, last3_change <dbl>,
#   t3_week_rank_avg <dbl>

Quad 1 Wins and Quad 4 Losses

This gives the number of wins against top teams and losses against bottom teams based on their average ordinal ranking at the end of the season.

quality_win_tracker <- read_csv(here("01_data/MRegularSeasonDetailedResults.csv")) %>% 
    rename_with(.fn = ~str_replace(string = ., pattern = "W", replacement = "A"), 
        .cols = starts_with("W")) %>% rename_with(.fn = ~str_replace(string = ., 
    pattern = "L", replacement = "B"), .cols = starts_with("L")) %>% mutate(win = if_else(AScore > 
    BScore, 1, 0), win_by = AScore - BScore) %>% select(Season, DayNum, ATeamID, 
    AScore, BTeamID, BScore, win, win_by) %>% bind_rows(read_csv(here("01_data/MRegularSeasonDetailedResults.csv")) %>% 
    rename_with(.fn = ~str_replace(string = ., pattern = "W", replacement = "B"), 
        .cols = starts_with("W")) %>% rename_with(.fn = ~str_replace(string = ., 
    pattern = "L", replacement = "A"), .cols = starts_with("L")) %>% mutate(win = 0, 
    win_by = BScore - AScore) %>% select(Season, DayNum, ATeamID, AScore, BTeamID, 
    BScore, win, win_by)) %>% arrange(Season, DayNum) %>% group_by(Season, ATeamID) %>% 
    mutate(AGames = row_number(), AWins = cumsum(win), ALosses = AGames - AWins) %>% 
    group_by(Season, BTeamID) %>% mutate(BGames = row_number(), BLosses = cumsum(win), 
    BWins = BGames - BLosses) %>% ungroup() %>% mutate(win_perc_A = AWins/AGames, 
    win_perc_B = BWins/BGames) %>% filter(AGames >= 10 & BGames >= 10) %>% group_by(Season, 
    ATeamID) %>% summarise(good_wins = sum(win == 1 & win_perc_B > 0.6), bad_loss = sum(win == 
    0 & win_perc_B < 0.3), wins = max(AWins), losses = max(BLosses)) %>% rename(TeamID = ATeamID) %>% 
    ungroup()

quad_win_helper <- read_csv(here("01_data/MRegularSeasonDetailedResults.csv")) %>% 
    rename_with(.fn = ~str_replace(string = ., pattern = "W", replacement = "A"), 
        .cols = starts_with("W")) %>% rename_with(.fn = ~str_replace(string = ., 
    pattern = "L", replacement = "B"), .cols = starts_with("L")) %>% mutate(win = if_else(AScore > 
    BScore, 1, 0), win_by = AScore - BScore) %>% select(Season, DayNum, ATeamID, 
    AScore, BTeamID, BScore, win, win_by) %>% bind_rows(read_csv(here("01_data/MRegularSeasonDetailedResults.csv")) %>% 
    rename_with(.fn = ~str_replace(string = ., pattern = "W", replacement = "B"), 
        .cols = starts_with("W")) %>% rename_with(.fn = ~str_replace(string = ., 
    pattern = "L", replacement = "A"), .cols = starts_with("L")) %>% mutate(win = 0, 
    win_by = BScore - AScore) %>% select(Season, DayNum, ATeamID, AScore, BTeamID, 
    BScore, win, win_by)) %>% left_join(ranking_data %>% select(TeamID, Season, rank_avg_B = rank_avg), 
    by = c("Season", BTeamID = "TeamID")) %>% mutate(game_of_interest_A = rank_avg_B < 
    30, game_of_interest_A_bad = rank_avg_B > 75)


quad_win_tracker <- quad_win_helper %>% count(Season, ATeamID, win, game_of_interest_A) %>% 
    filter(!is.na(game_of_interest_A)) %>% filter(game_of_interest_A) %>% filter(win == 
    1) %>% select(Season, TeamID = ATeamID, quad_wins = n) %>% full_join(quad_win_helper %>% 
    count(Season, BTeamID, win, game_of_interest_A_bad) %>% filter(!is.na(game_of_interest_A_bad)) %>% 
    filter(game_of_interest_A_bad) %>% filter(win == 0) %>% select(Season, TeamID = BTeamID, 
    quad_loss = n)) %>% mutate(across(.cols = c(quad_wins, quad_loss), .fns = ~replace_na(., 
    0)))

head(quad_win_helper)
# A tibble: 6 x 11
  Season DayNum ATeamID AScore BTeamID BScore   win win_by rank_avg_B
   <dbl>  <dbl>   <dbl>  <dbl>   <dbl>  <dbl> <dbl>  <dbl>      <dbl>
1   2003     10    1104     68    1328     62     1      6       6.12
2   2003     10    1272     70    1393     63     1      7      13.1 
3   2003     11    1266     73    1437     61     1     12      74.4 
4   2003     11    1296     56    1457     50     1      6     192.  
5   2003     11    1400     77    1208     71     1      6      12.7 
6   2003     11    1458     81    1186     55     1     26     120.  
# ... with 2 more variables: game_of_interest_A <lgl>,
#   game_of_interest_A_bad <lgl>

Final Modeling Data

This compiles all the data for building the model in the next step. We used every tournament game since 2003 to build our model. Each factor had to be represented twice in each line of code—once for each team.

base_builder <- read_csv(here("01_data/MNCAATourneyDetailedResults.csv")) %>% select(Season, 
    contains("Team"), contains("Score"), DayNum) %>% mutate(win_by = WScore - LScore) %>% 
    relocate(-win_by) %>% mutate(win = if_else(win_by > 0, "win", "lose")) %>% select(Season, 
    WTeamID, LTeamID, DayNum, win_by, win)


base_builder <- base_builder %>% bind_rows(base_builder %>% rename(LTeamID = WTeamID, 
    WTeamID = LTeamID) %>% mutate(win_by = -win_by, win = "lose"))


staging_data <- ranking_data %>% left_join(summary_stats, by = c("Season", "TeamID")) %>% 
    distinct() %>% left_join(quality_win_tracker) %>% left_join(read_csv(here("01_data/MTeamConferences.csv")), 
    by = c("Season", "TeamID")) %>% relocate(ConfAbbrev, .after = Season) %>% rename(conf = ConfAbbrev)


model_data <- base_builder %>% left_join(staging_data, by = c(WTeamID = "TeamID", 
    "Season")) %>% left_join(staging_data, by = c(LTeamID = "TeamID", "Season"), 
    suffix = c("_A", "_B")) %>% mutate(conf_record_one = str_c(conf_A, "_", conf_B)) %>% 
    mutate(conf_record_two = str_c(conf_B, "_", conf_A)) %>% left_join(conf_rank, 
    by = c("Season", conf_record_one = "conf_record")) %>% left_join(conf_rank, by = c("Season", 
    conf_record_two = "conf_record"), suffix = c("_against_B", "_against_A")) %>% 
    select(-c(conf_record_one, conf_record_two, conf_A, conf_B)) %>% mutate(across(.cols = contains("conf_"), 
    .fns = ~replace_na(., 0))) %>% left_join(rankings, by = c(WTeamID = "TeamID", 
    "Season")) %>% left_join(rankings, by = c(LTeamID = "TeamID", "Season"), suffix = c("_A", 
    "_B")) %>% left_join(seeds, by = c(WTeamID = "TeamID", "Season")) %>% left_join(seeds, 
    by = c(LTeamID = "TeamID", "Season"), suffix = c("_A", "_B")) %>% left_join(quad_win_tracker, 
    by = c(WTeamID = "TeamID", "Season")) %>% left_join(quad_win_tracker, by = c(LTeamID = "TeamID", 
    "Season"), suffix = c("_A", "_B")) %>% group_split(Season < 2015) %>% set_names(c("Test", 
    "Train")) %>% map(~select(., -`Season < 2015`))

head(model_data)
$Test
# A tibble: 670 x 94
   Season WTeamID LTeamID DayNum win_by win   rank_avg_A  OE_A  DE_A clust_A
    <dbl>   <dbl>   <dbl>  <dbl>  <dbl> <chr>      <dbl> <dbl> <dbl>   <dbl>
 1   2015    1214    1264    134     10 win       253.   0.928 0.958       9
 2   2015    1279    1140    134      4 win        49.2  1.06  0.98        1
 3   2015    1173    1129    135      1 win        42.5  1.04  0.915       1
 4   2015    1352    1316    135      4 win       173.   0.998 0.981       5
 5   2015    1112    1411    136     21 win         4.21 1.12  0.857       1
 6   2015    1116    1459    136      3 win        22.3  1.08  0.964       2
 7   2015    1139    1400    136      8 win        24.5  1.04  0.91        1
 8   2015    1153    1345    136      1 win        45.8  1.01  0.891       5
 9   2015    1207    1186    136     10 win        23.2  1.05  0.958       1
10   2015    1209    1124    136      1 win        77.5  1.07  0.939       1
# ... with 660 more rows, and 84 more variables:
#   last3offensive_efficiency_A <dbl>, last3possessions_per_game_A <dbl>,
#   last3defensive_efficiency_A <dbl>, overallfree_throw_rate_A <dbl>,
#   col_A <dbl>, ratingschedule_strength_by_other_A <dbl>,
#   overallpercent_of_points_from_3_pointers_A <dbl>, last3_change_A <dbl>,
#   t3_week_rank_avg_A <dbl>, Ast_A <dbl>, Blk_A <dbl>, DR_A <dbl>,
#   FGA_A <dbl>, FGA3_A <dbl>, FGM_A <dbl>, FGM3_A <dbl>, FTA_A <dbl>,
#   FTM_A <dbl>, OR_A <dbl>, PF_A <dbl>, Score_A <dbl>, Stl_A <dbl>,
#   TO_A <dbl>, avg_win_A <dbl>, avg_win_by_A <dbl>, good_wins_A <int>,
#   bad_loss_A <int>, wins_A <dbl>, losses_A <dbl>, rank_avg_B <dbl>,
#   OE_B <dbl>, DE_B <dbl>, clust_B <dbl>, last3offensive_efficiency_B <dbl>,
#   last3possessions_per_game_B <dbl>, last3defensive_efficiency_B <dbl>,
#   overallfree_throw_rate_B <dbl>, col_B <dbl>,
#   ratingschedule_strength_by_other_B <dbl>,
#   overallpercent_of_points_from_3_pointers_B <dbl>, last3_change_B <dbl>,
#   t3_week_rank_avg_B <dbl>, Ast_B <dbl>, Blk_B <dbl>, DR_B <dbl>,
#   FGA_B <dbl>, FGA3_B <dbl>, FGM_B <dbl>, FGM3_B <dbl>, FTA_B <dbl>,
#   FTM_B <dbl>, OR_B <dbl>, PF_B <dbl>, Score_B <dbl>, Stl_B <dbl>,
#   TO_B <dbl>, avg_win_B <dbl>, avg_win_by_B <dbl>, good_wins_B <int>,
#   bad_loss_B <int>, wins_B <dbl>, losses_B <dbl>, conf_wins_against_B <dbl>,
#   conf_loss_against_B <dbl>, conf_wins_against_A <dbl>,
#   conf_loss_against_A <dbl>, RankingDayNum_A <dbl>, MOR_A <dbl>, POM_A <dbl>,
#   RPI_A <dbl>, SAG_A <dbl>, WLK_A <dbl>, RankingDayNum_B <dbl>, MOR_B <dbl>,
#   POM_B <dbl>, RPI_B <dbl>, SAG_B <dbl>, WLK_B <dbl>, Seed_A <dbl>,
#   Seed_B <dbl>, quad_wins_A <dbl>, quad_loss_A <dbl>, quad_wins_B <dbl>,
#   quad_loss_B <dbl>

$Train
# A tibble: 1,560 x 94
   Season WTeamID LTeamID DayNum win_by win   rank_avg_A  OE_A  DE_A clust_A
    <dbl>   <dbl>   <dbl>  <dbl>  <dbl> <chr>      <dbl> <dbl> <dbl>   <dbl>
 1   2003    1421    1411    134      8 win       240.   0.986 1.09        8
 2   2003    1112    1436    136     29 win         2.68 1.09  0.891       2
 3   2003    1113    1272    136     13 win        36    1.06  0.962       5
 4   2003    1141    1166    136      6 win        45.7  1.06  0.994       5
 5   2003    1143    1301    136      2 win        36.4  1.03  0.968       1
 6   2003    1163    1140    136      5 win        27.5  1.07  0.946       2
 7   2003    1181    1161    136     10 win        10.7  1.08  0.928       2
 8   2003    1211    1153    136      5 win        43.1  1.08  0.953       1
 9   2003    1228    1443    136      5 win        10.5  1.08  0.877       2
10   2003    1242    1429    136      3 win         5.97 1.08  0.872       2
# ... with 1,550 more rows, and 84 more variables:
#   last3offensive_efficiency_A <dbl>, last3possessions_per_game_A <dbl>,
#   last3defensive_efficiency_A <dbl>, overallfree_throw_rate_A <dbl>,
#   col_A <dbl>, ratingschedule_strength_by_other_A <dbl>,
#   overallpercent_of_points_from_3_pointers_A <dbl>, last3_change_A <dbl>,
#   t3_week_rank_avg_A <dbl>, Ast_A <dbl>, Blk_A <dbl>, DR_A <dbl>,
#   FGA_A <dbl>, FGA3_A <dbl>, FGM_A <dbl>, FGM3_A <dbl>, FTA_A <dbl>,
#   FTM_A <dbl>, OR_A <dbl>, PF_A <dbl>, Score_A <dbl>, Stl_A <dbl>,
#   TO_A <dbl>, avg_win_A <dbl>, avg_win_by_A <dbl>, good_wins_A <int>,
#   bad_loss_A <int>, wins_A <dbl>, losses_A <dbl>, rank_avg_B <dbl>,
#   OE_B <dbl>, DE_B <dbl>, clust_B <dbl>, last3offensive_efficiency_B <dbl>,
#   last3possessions_per_game_B <dbl>, last3defensive_efficiency_B <dbl>,
#   overallfree_throw_rate_B <dbl>, col_B <dbl>,
#   ratingschedule_strength_by_other_B <dbl>,
#   overallpercent_of_points_from_3_pointers_B <dbl>, last3_change_B <dbl>,
#   t3_week_rank_avg_B <dbl>, Ast_B <dbl>, Blk_B <dbl>, DR_B <dbl>,
#   FGA_B <dbl>, FGA3_B <dbl>, FGM_B <dbl>, FGM3_B <dbl>, FTA_B <dbl>,
#   FTM_B <dbl>, OR_B <dbl>, PF_B <dbl>, Score_B <dbl>, Stl_B <dbl>,
#   TO_B <dbl>, avg_win_B <dbl>, avg_win_by_B <dbl>, good_wins_B <int>,
#   bad_loss_B <int>, wins_B <dbl>, losses_B <dbl>, conf_wins_against_B <dbl>,
#   conf_loss_against_B <dbl>, conf_wins_against_A <dbl>,
#   conf_loss_against_A <dbl>, RankingDayNum_A <dbl>, MOR_A <dbl>, POM_A <dbl>,
#   RPI_A <dbl>, SAG_A <dbl>, WLK_A <dbl>, RankingDayNum_B <dbl>, MOR_B <dbl>,
#   POM_B <dbl>, RPI_B <dbl>, SAG_B <dbl>, WLK_B <dbl>, Seed_A <dbl>,
#   Seed_B <dbl>, quad_wins_A <dbl>, quad_loss_A <dbl>, quad_wins_B <dbl>,
#   quad_loss_B <dbl>

Submission Preparation Data

Builds the data frame of factors for every possible pair of teams in the 2021 tournament, which we will use to make predictions and submit to Kaggle.

# A tibble: 6 x 91
  Season WTeamID LTeamID rank_avg_A  OE_A  DE_A clust_A last3offensive_efficien~
   <dbl>   <dbl>   <dbl>      <dbl> <dbl> <dbl>   <dbl>                    <dbl>
1   2021    1101    1104       83.5  1.03  0.85       2                     1.03
2   2021    1101    1111       83.5  1.03  0.85       2                     1.03
3   2021    1101    1116       83.5  1.03  0.85       2                     1.03
4   2021    1101    1124       83.5  1.03  0.85       2                     1.03
5   2021    1101    1140       83.5  1.03  0.85       2                     1.03
6   2021    1101    1155       83.5  1.03  0.85       2                     1.03
# ... with 83 more variables: last3possessions_per_game_A <dbl>,
#   last3defensive_efficiency_A <dbl>, overallfree_throw_rate_A <dbl>,
#   col_A <dbl>, ratingschedule_strength_by_other_A <dbl>,
#   overallpercent_of_points_from_3_pointers_A <dbl>, last3_change_A <dbl>,
#   t3_week_rank_avg_A <dbl>, Ast_A <dbl>, Blk_A <dbl>, DR_A <dbl>,
#   FGA_A <dbl>, FGA3_A <dbl>, FGM_A <dbl>, FGM3_A <dbl>, FTA_A <dbl>,
#   FTM_A <dbl>, OR_A <dbl>, PF_A <dbl>, Score_A <dbl>, Stl_A <dbl>,
#   TO_A <dbl>, avg_win_A <dbl>, avg_win_by_A <dbl>, good_wins_A <int>,
#   bad_loss_A <int>, wins_A <dbl>, losses_A <dbl>, rank_avg_B <dbl>,
#   OE_B <dbl>, DE_B <dbl>, clust_B <dbl>, last3offensive_efficiency_B <dbl>,
#   last3possessions_per_game_B <dbl>, last3defensive_efficiency_B <dbl>,
#   overallfree_throw_rate_B <dbl>, col_B <dbl>,
#   ratingschedule_strength_by_other_B <dbl>,
#   overallpercent_of_points_from_3_pointers_B <dbl>, last3_change_B <dbl>,
#   t3_week_rank_avg_B <dbl>, Ast_B <dbl>, Blk_B <dbl>, DR_B <dbl>,
#   FGA_B <dbl>, FGA3_B <dbl>, FGM_B <dbl>, FGM3_B <dbl>, FTA_B <dbl>,
#   FTM_B <dbl>, OR_B <dbl>, PF_B <dbl>, Score_B <dbl>, Stl_B <dbl>,
#   TO_B <dbl>, avg_win_B <dbl>, avg_win_by_B <dbl>, good_wins_B <int>,
#   bad_loss_B <int>, wins_B <dbl>, losses_B <dbl>, conf_wins_against_B <dbl>,
#   conf_loss_against_B <dbl>, conf_wins_against_A <dbl>,
#   conf_loss_against_A <dbl>, RankingDayNum_A <dbl>, MOR_A <dbl>, POM_A <dbl>,
#   RPI_A <dbl>, SAG_A <dbl>, WLK_A <dbl>, RankingDayNum_B <dbl>, MOR_B <dbl>,
#   POM_B <dbl>, RPI_B <dbl>, SAG_B <dbl>, WLK_B <dbl>, Seed_A <dbl>,
#   Seed_B <dbl>, quad_wins_A <dbl>, quad_loss_A <dbl>, quad_wins_B <dbl>,
#   quad_loss_B <dbl>

Workflow

Version Control

We hosted all our code on github and used git to manage version control.

Organized Folder Structure

We managed our files in the following folder structure.

  • 01_data
    • 26 csvs
  • 02_scripts
    • data_prep.R
    • support_functions.R
    • execute.R
  • 03_submissions
    • submission files
  • model_registry.csv

We wanted to minimize the “clutter” of our working script. We placed data cleaning / preparation code in the data_prep.R script and support functions we developed to help us model in the support_functions.R script.

Functional Modeling

We created one main function that would do the following:

  • Split data into test/train
  • Create a classification or regression model
  • Use a elastic net, random forest, or boosted tree
  • Accept a user-defined set of factors to create the model
  • Tune the modeling parameters over a user-defined grid size
  • Output appropriate accuracy metrics (Root Mean Squared Error (RMSE) or accuracy)
  • Output a submission file to upload into the Kaggle competition
  • Save model performance to the model registry (mode details below)

Track Models

We created a model registry to support our model building. When we first execute a model, the model execution function first checks to see if this is a repeat (often time consuming) model.

Then the registry would track the following information.

  • Model engine (random forest, elastic net, boosted tree)
  • Type of model (regression or classification)
  • Regression equation (win ~ factor 1 + factor 2 + ... + factor n)
  • Tuning grid size (for determining parameters)
  • Model performance
  • Performance metric (RMSE / Accuracy)
  • Special notes

Example output from registry:

# A tibble: 6 x 7
  type_of_model mode   regression_formula     grid_size notes   model_performan~
  <chr>         <chr>  <chr>                      <dbl> <chr>              <dbl>
1 random_forest class~ "~winrank_avg_A + ran~       100 valida~            0.671
2 random_forest class~ "~winrank_avg_A + ran~       100 valida~            0.562
3 random_forest class~ "~winrank_avg_A + ran~       100 valida~            0.526
4 random_forest class~ "~winrank_avg_A + ran~       100 valida~            0.532
5 random_forest class~ "~winrank_avg_A + ran~       100 valida~            0.598
6 random_forest class~ "~winrank_avg_A + ran~       100 valida~            0.500
  performance_measure
  <chr>              
1 kaggle             
2 kaggle             
3 kaggle             
4 kaggle             
5 kaggle             
6 kaggle             

Methodology

  • Build model off historical tournaments
  • Use end of season data
  • Regular season games not used in training data

Functions

Important Libraries

These were the main libraries.

Set Up Code

This code specified which libraries to use for specific function conflicts.

Workhorse Modeling Function

This is our main function used for modeling.

Inputs

  • Training data
  • Mode (classification or regression)
  • Regression equation (win ~ factor 1 + factor 2 + ... + factor n)
  • Grid size
  • Prediction data
  • Notes
  • Submission model

Outputs

  • Updated model registry
  • Accuracy/RMSE of the model
  • File to submit to Kaggle
register_make_execute_evaluate <- function(data = stats, 
                                           type_of_model = "random_forest",
                                           mode = "classification",
                                           # mode = "regression",
                                           regression_formula = formula(win ~ ARPI + ABPI + BRPI + BBPI),
                                           # regression_formula = formula(win_by ~ ARPI + ABPI + BRPI + BBPI),
                                           grid_size = 1,
                                           prediction_data = stats_final,
                                           notes = "current best model make into regression",
                                           submission_mode = F
                                           ){

current_model_data <- tibble(type_of_model = type_of_model, 
                             mode = mode, 
                             regression_formula = str_c(as.character(regression_formula),collapse = ""),
                             grid_size = grid_size,
                             notes = notes)


if(!submission_mode){
  reg <- register_model(current_model_data = current_model_data)
}
    
set.seed(123)
splits <- initial_split(data, strata = Season)

stats_other <- training(splits)
stats_test  <- testing(splits)

set.seed(234)
val_set <- validation_split(stats_other, 
                            strata = Season, 
                            prop = 0.80)

cores <- parallel::detectCores()

message(str_c("you are using ", cores, " cores"))

if(type_of_model == "random_forest") {
  mod <-
    rand_forest(mtry = tune(),min_n = tune(),trees = tune()) %>%
    set_engine("ranger", num.threads = cores, keep.inbag = TRUE)
} else if(type_of_model == "log_reg"){
  mod <-
    logistic_reg(penalty = tune(),mixture = tune()) %>%
    set_engine("glmnet", num.threads = cores, keep.inbag = TRUE)
} else if(type_of_model == "boost_tree"){
  mod <-
    boost_tree(mtry = tune(),trees = tune(), min_n = tune(), tree_depth = tune(), learn_rate = tune(), loss_reduction = tune(), sample_size = tune(), stop_iter = 10) %>%
    set_engine("xgboost", num.threads = cores, keep.inbag = TRUE)
}

if (mode == "regression") {
  mod <-
    mod %>%
    set_mode("regression")
} else if (mode == "classification") {
  mod <-
    mod %>%
    set_mode("classification")
} 
  
  recipe <- 
    recipe(regression_formula, data = stats_other) %>%
    step_interact(terms = ~ conf_wins_against_B /(conf_wins_against_B + conf_loss_against_B)) %>%
    step_interact(terms = ~ conf_wins_against_A / (conf_wins_against_A + conf_loss_against_A)) %>%
    step_nzv(all_predictors(), - all_outcomes()) %>% 
    step_knnimpute(all_predictors()) %>%
    # step_meanimpute(all_numeric(), - all_outcomes()) %>% 
    step_center(all_numeric(), - all_outcomes()) %>% 
    step_scale(all_numeric(), - all_outcomes())  
  
  workflow <- 
    workflow() %>% 
    add_model(mod) %>% 
    add_recipe(recipe)
  
  if(mode == "regression"){
    metrics <- c("rmse", "rsq", "ccc")
    metrics_id <- metric_set(rmse, rsq, ccc)
  } else if(mode == "classification"){
    metrics <- c("accuracy", "kap")
    metrics_id <- metric_set(accuracy, kap)
  }
  
  message(str_c("Begin CV to tune parameters with grid size of ", grid_size, " with ", metrics[1]), " on a ", mode, " model.")
  
  set.seed(345)
  res <- 
    workflow %>% 
    tune_grid(val_set,
              grid = grid_size,
              control = control_grid(save_pred = TRUE, verbose = T),
              metrics = metrics_id)
  
  message(str_c("Complete CV to tune parameters with grid size of ", grid_size))
  
  best <- 
    res %>% 
    select_best(metric = metrics[1])
  
  if(type_of_model == "random_forest"){
    last_mod <- 
      rand_forest(mtry = best$mtry, min_n = best$min_n, trees = best$trees) %>% 
      set_engine("ranger", num.threads = cores, keep.inbag=TRUE, importance = "impurity") 
  } else if(type_of_model == "log_reg"){
    last_mod <- 
      logistic_reg(penalty = best$penalty, mixture = best$mixture) %>% 
      set_engine("glmnet", num.threads = cores, keep.inbag=TRUE) 
  } else if(type_of_model == "boost_tree"){
    last_mod <- 
      boost_tree(mtry = best$min_n,trees = best$trees, min_n = best$min_n, tree_depth = best$tree_depth, learn_rate = best$learn_rate, loss_reduction = best$loss_reduction, sample_size = best$sample_size, stop_iter = 10) %>%
      set_engine("xgboost", num.threads = cores, keep.inbag=TRUE) 
  }
  
  if(mode == "regression"){
    
    last_mod <-  
      last_mod %>% 
      set_mode("regression")
  }else if(mode == "classification"){
    
    last_mod <-  
      last_mod %>% 
      set_mode("classification")
  }
  
  last_workflow <- 
    workflow %>% 
    update_model(last_mod)
  
  set.seed(345)
  last_fit <- 
    last_workflow %>% 
    last_fit(splits)
  
  message(str_c("Begin model on entire data"))
  
  final_model <- fit(last_workflow, data)
  
  if(mode == "regression"){
    
    message(str_c("Begin make predictions"))
    
    data_with_predictions <-
      prediction_data %>% 
      bind_cols(predict(final_model, new_data = prediction_data)) 
    
    if(!submission_mode){
    
      message(str_c("Score Model"))
  
      model_performance <-
        data_with_predictions %>% 
        group_by(Season) %>% 
        rmse(truth = win_by, estimate = .pred) %>% 
        summarise(mean_RMSE = mean(.estimate))
      
      message(str_c("This model has a ", metrics[1], " of ", round(model_performance$mean_RMSE,2)))
      
    
      new_registry <-
        reg %>% 
        bind_rows(
          current_model_data %>% 
            mutate(model_performance = as.double(model_performance), performance_measure = metrics[1])
        ) 
      
    }
   
  }else if(mode == "classification"){
    
    message(str_c("Begin make predictions"))
    
  
    data_with_predictions <-
      prediction_data %>% 
      bind_cols(predict(final_model, new_data = prediction_data, type = "prob")) %>% 
      bind_cols(predict(final_model, new_data = prediction_data)) 
    
    message("post set up data")
    
    if(!submission_mode){
      
    message(str_c("Score Model"))
    
    model_performance <-
      data_with_predictions %>% 
        mutate(across(.cols = c(.pred_lose,.pred_win),.fns = ~if_else(.>.999,.999,.) )) %>% 
        mutate(across(.cols = c(.pred_lose,.pred_win),.fns = ~if_else(.<.001,.001,.) )) %>% 
        mutate(kaggle = if_else(win == "win", -log(.pred_win), -log(.pred_lose))) %>% 
      group_by(Season) %>% 
        summarise(kaggle = mean(kaggle), accuracy = sum(.pred_class==win)/n()) %>% 
      summarise(mean_accuracy = mean(accuracy), mean_kaggle = mean(kaggle))
  
    message(str_c("This model has a ", metrics[1], " of ", round(model_performance$mean_accuracy,2), " and Kaggle score of ", round(model_performance$mean_kaggle,6)))
  
    new_registry <-
      reg %>% 
      bind_rows(
        current_model_data %>% 
          mutate(model_performance = as.double(model_performance$mean_kaggle), performance_measure = "Kaggle")
      ) 
    }
      
  }

  if(!submission_mode){
    write_csv(new_registry, "model_registry.csv")
  } else {
    if(mode=="regression"){
      data_with_predictions <-
        data_with_predictions %>%
        unite(col = ID, sep = "_", Season, WTeamID, LTeamID) %>%
        select(ID, Pred = contains("pred"))
    } else if(mode == "classification"){
      data_with_predictions <-
        data_with_predictions %>%
        unite(col = ID, sep = "_", Season, WTeamID, LTeamID) %>%
        select(ID, Pred = .pred_win)
    }
    id <-
    tibble(files = list.files("03_submissions")) %>% 
      mutate(files = str_remove(files, ".csv")) %>% 
      separate(files, sep = "_", c("trash","trash2","trash3","date","id")) %>% 
      select(date, id) %>% 
      mutate(date = lubridate::ymd(date), id = as.integer(id)) %>% 
      filter(date >= lubridate::today()) %>% 
      summarise(id = max(id)) %>% 
      mutate(id = ifelse(id==-Inf,1,id + 1)) %>% pull(id)
    
    write_csv(data_with_predictions, str_c("03_submissions/prediction_data_", mode,"_", lubridate::today(),"_",id, ".csv"))
  } 

 return(data_with_predictions)
}

Execution

We load the support functions and data before beginning our modeling.

Parallel Processing

We ‘register’ the cores on our computer to support parallel processing during model building — this greatly increases the speed of our code.

[1] 8

Model Estimates

# A tibble: 2 x 10
  TeamName rank_avg_A t3_week_rank_avg~ last3offensive_effi~ overallfree_throw_~
  <chr>         <dbl>             <dbl>                <dbl>               <dbl>
1 Ohio           83.4             113.                  1.09                21.6
2 Virginia       17.8              22.6                 1.04                17.1
  last3defensive_effic~ ratingschedule_streng~ last3possessions_pe~ Seed_A col_A
                  <dbl>                  <dbl>                <dbl>  <dbl> <dbl>
1                 1.01                     0.1                 74.2     13    79
2                 0.964                   10.2                 61.2      4    20

# A tibble: 4 x 10
  TeamName   rank_avg_A t3_week_rank_avg~ last3offensive_eff~ overallfree_throw~
  <chr>           <dbl>             <dbl>               <dbl>              <dbl>
1 Arkansas         13.3              13.8               1.14                24.2
2 Florida          41.5              29.2               0.975               23.7
3 Ohio St          10.7              10.8               0.949               27.6
4 Oral Robe~      160.              178.                1.08                20.9
  last3defensive_effic~ ratingschedule_streng~ last3possessions_pe~ Seed_A col_A
                  <dbl>                  <dbl>                <dbl>  <dbl> <dbl>
1                  0.96                    9.2                 79.1      3    11
2                  1.02                   10.7                 66.6      7    50
3                  1.07                   12.7                 67.4      2    10
4                  1.00                   -1.8                 74.6     15   181

Final Thoughts

Challenges

  • Interactions of factors
  • Coding the regression equation was clunky
  • “Team A” vs. “Team B”
  • Non balanced output: P(A beats B) != P(B beats A)
  • Four days “flash to bang”
  • Day jobs - we had to do real work!

Ideas for Next Year

  • Explore how to incorporate regular season data (or at least conference championships) – We are throwing too much data away
  • Functionalize automation of variable selection
  • Explore team structure (strong front court vs. back court), star player, etc.
  • Network science application

What Didn’t Work

  • Bayesian model too computationally intensive
  • AutoML

What Helped us be Successful

  • Upsets (and the right ones)
  • Git
  • Functions
  • Model registry
  • Tidymodels

ESPN

Takeaways

  • This is for fun - and we had fun
  • Our workflow can be used broadly (e.g., learned it from COVID modeling)
  • Should be ready to pick up where we left off next year"